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Abstract The problem of minimizing a convex function over an intersection 
of closed convex sets is ubiquitous in applied mathematics. It is particularly 
interesting when it is easy to project onto each separate set, but nontrivial to 
project onto their intersection. Algorithms based on Newton's method such 
as the interior point method are viable for small to medium-scale problems. 
However, modern applications in machine learning are posing problems with 
tens of thousands of parameters or more. We revisit this convex program- 
ming problem and propose an algorithm that scales well with dimensional- 
ity. Our proposal revolves around three ideas: the majorization-minimization 
(MM) principle, the classical penalty method for constrained optimization, and 
quasi-Newton acceleration of fixed-point algorithms. The performance of our 
distance majorization algorithms is demonstrated on some modern machine 
learning problems. 
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1 Introduction 

A wide spectrum of problems in applied mathematics and statistics can be 
formulated as an instance of the convex programming problem 

min £(x), (1) 

where £(x) is convex and the C, are closed convex sets in R p . At one extreme, 
problem ([lj includes classical least squares. At the other, it includes finding 
a feasible point in the intersection of several closed convex sets. In between, 
the formulation covers a variety of shape restricted regression problems such 
as fitting a support vector machine and projecting an exterior point onto a 
complicated convex set. Great progress has been made in attacking specific 
incarnations of problem ([I]). The projected gradient algorithm and its Newton 
and quasi-Newton extensions have been very successful when constraints are 
simple, for example box constraints, and admit a correspondingly simple pro- 
jection operator [4 , 22 . 33 , 35] . However, there is still room for improvement. 
In the current paper we present a unified approach to solving a smoothed 
relaxation of problem ([I]) via the majorization-minimization (MM) principle 
[23] . This approach is especially attractive when it is easy to project onto each 
separate set Ci but nontrivial to project onto their intersection. 

Problem |l]) can be written as the unconstrained optimization problem 

mm£(x)+^2S Cl (x), (2) 

i 

where the indicator function 8c(x) equals if x £ C and oo if not. Although 
problem ([2| is now unconstrained, the indicator functions render the objective 
function non-differentiable. This prompts us to replace 8c{x) by a smooth 
approximation dist(x, C) 2 = inf yS c \\x — j/Hf, where || • || 2 denotes the standard 
Euclidean norm. Further progress can be made by solving the related problem 

m 

min£(a;) + dist(a;,C7 4 ) 2 , (3) 

i=l 

where /i is a positive parameter that penalizes deviation from the original 
feasible region. The smooth approximation introduced in formulation ([3| is 
an example the quadratic penalty method [5l l29ll34] . Problem ^ has many 
appealing features. The problem is unconstrained with an objective func- 
tion that is convex and differentiable when £{x) is convex and differentiablc. 
Consequently optimality conditions can be readily identified. The distance 
function is closely tied to the projection Pc{x) of x onto C; specifically 
dist(x,C) = \\x - P c (x)\\ 2 , and Vdist(a;,C) 2 = 2[x - P c (x)]. Thus, a point 
x solves problem fcfy if and only if it satisfies the stationarity condition 



= V£(x) +nY,i x - p c, ( x )] ■ 

i 
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Of course finding such an x is often analytically intractable due to the pro- 
jection term. To solve ^ iteratively, we resort to the MM principle. Because 
we rely on majorizing dist(a;, C), we call our approach distance majorization. 
A key step in solving the subproblems will be calculating projection operators. 
Fortunately, many useful projection operators are easy to compute. The best 
known examples include projection onto: (a) a closed Euclidean ball, (b) a 
closed rectangle, (c) a hyperplane, (d) a closed halfspace, (e) a vector sub- 
space, (f) the set of positive semidefinite matrices, (g) the unit simplex, (h) a 
closed i\ ball, and (i) an isotone convex cone. While there are no analytic solu- 
tions for the last three projections, there are efficient algorithms for computing 

them nzumiiiiniiinT] . 

The rest of the paper is organized as follows. After a brief review of the 
MM principle, we illustrate the virtues of distance majorization in five different 
problem areas: (a) finding a point in the intersection of a finite collection of 
closed convex sets, (b) projection of a point onto the closest point in the 
intersection of a finite collection of closed convex sets, (c) convex regression, 
(d) classification via support vector machines, and (e) the facilities location 
problem. The literature on some of the examples is enormous, so we apologize 
in advance for omitting relevant references and slighting the ramifications of 
the various models. After our tour of examples, we present relevant convergence 
theory in a general algorithmic framework. Our concluding discussion indicates 
a few extensions and limitations of distance majorization. 

2 The MM Principle and Distance Majorization 

Although first articulated by the numerical analysts Ortega and Rheinboldt 
[30j . the MM principle currently enjoys its greatest vogue in computational 
statistics [3,25]. The basic idea is to convert a hard optimization problem 
(for example, non-differcntiable) into a sequence of simpler ones (for example, 
smooth). The MM principle requires majorizing the objective function f(y) by 
a surrogate function g(y \ x) anchored at the current point x. Majorization is 
a combination of the tangency condition g(x \ x) = f(x) and the domination 
condition g(y \ x) > f(y) for all y € R p . The iterates of the associated MM 
algorithm are defined by 

x n+ i := argmin g(y \ x n ). (4) 
v 

Because 

f{x n+ i) < g{x n+1 | x n ) < g(x„ | x n ) = f(x n ), (5) 

the MM iterates generate a descent algorithm driving the objective function 
downhill. Strict inequality usually prevails unless stationary point of 

f(x). 

The most useful majorization of dist(;c, C) follows immediately from the 
observations 

dist(x,C) < \\x - P c (x n )\\ 2l and dist(x n ,C) = \\x n - Pc{x n )h 
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Algorithm 1 Distance Majorization 

1: Given fj,g > and a starting point xq. 

2: k 4r- 

3: repeat 
4: y «- a; fc 
5: repeat 

6: for i = 1, . . . , m do 

7: Pi <- Pd (y) 

8: end for 

9: y <- argmin l(u) + fi k E™i 7i||« "Pilll 

10: until convergence 

11: Choose new penalty parameter fj-k+i > Mfc 

12: fc^fe + 1 

13: x k <- y 

14: until convergence 



for all pairs x and x n . In practice, majorizing d\st(x,C) 2 by ||x — Pc( ;c ri)||2 
leads to more convenient updates than majorizing dist(a;, C) by \\x— Pc{x n )\[2- 
Most of our applications can be phrased as minimizing the criterion 

m 

g(x | x n )=i{x) + ^^ li \\x-P Ci {x n )\\l, 

i=l 

for a convex loss £(x), a collection {Ci, . . . , C m } of closed convex sets, a pos- 
itive penalization parameter [i, and a corresponding set of convex multipliers 
7l,... ,7 m . Algorithm [l] shows the pseudocode for the distance majorization 
algorithm. 

We highlight the fact that the algorithm does not require the projection 
onto the intersection but rather only the projection onto each of the con- 
stituent sets Ci. As we will see in our first example, distance majorization 
can be considered a generalization of the simultaneous projection algorithm 
for finding a point in the intersection of a collection of closed convex sets. We 
note, however, that distance majorization is not unique in this regard. For 
comparison's sake, we will also present a dual ascent algorithm at the end of 
the next section that employs projections onto the constituent sets. Although 
the two methods exhibit comparable empirical performance, the distance ma- 
jorization algorithm is guaranteed to converge under weaker conditions than 
the dual ascent algorithm. 

3 Examples of Distance Majorization 

Finding a Feasible Point 

When the intersection C = C\" l =1 Cj is nonempty, majorization can be employed 
to locate a point in C . The general idea is to drive the convex combination 

m 

f(x) = ^7,dist( : E,C 4 ) 2 

i=l 



Distance Majorization and Its Applications 



5 



to 0. Minimization of the surrogate function 

m 

g(x | x n ) = ^ 7l ||a; - P Cz (x n )\\ 2 2 

i=l 

leads to the well-known simultaneous projection algorithm 

m 

x n +i = ^liPci(x n ). 

i=l 

The earliest version of this algorithm is attributed to Cimmino [12] • It does 
not necessarily find the closest point in C to x. The evidence suggests that si- 
multaneous projection converges more slowly than alternating projection jTUl 
ITS] . However, simultaneous projection enjoys the advantage of being paral- 
lelizable. One can invoke the theory of paracontractive operators to prove the 
convergence of both simultaneous and alternating projections [9]. 

The alternating projection algorithm can also be derived by distance ma- 
jorization. The least distance between two closed convex sets C\ and C2 can be 
found by minimimizing dist(a;, C2) 2 over x G Ci. If we majorize dist(a;, C2) by 
the surrogate function \\x — y n \\ 2 , where y n = Pc 2 (x n ), then the minimum of 
the surrogate occurs at Pc 1 (y n ) — Pc x [Pc 2 ( x n)]- When the two sets intersect, 
the least distance of is achieved at any point in the intersection. Our simple 
derivations of the simultaneous and alternating projection algorithms by the 
MM principle are likely novel. 



Projection onto the Intersection of Closed Convex Sets 

We next consider how distance majorization can be used to find the closest 
point in the intersection C to a point x. This involves minimizing the strictly 
convex function 

1 m 

i=l 

for fi large. The solution tends to the optimal point as \x tends to 00. 

The MM update for minimizing the surrogate function 

_^ TCI 

g^x | x n ) = -\\x-yf + ^^2j t \\x- P Cz {x n )\\ 2 

i=l 

is the convex combination 

^ m 
x n+i = ir— y+ r— y2jiP Ci ( x n)' (6) 

The corresponding algorithm map M^(x) is strictly contractive with contrac- 
tion constant c = ///(! + //). According to the contraction mapping theorem, 
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the iterates converge to the unique fixed point at geometric rate c. This fixed 
point coincides with the minimum point of the function f^ix). Indeed, fn(x) 
is diffcrentiable with gradient 

m 

S7fn(x) = x - y + |U^7i[x - P C A X )\- 

i=l 

Rearrangement of the stationarity condition V/u(a;) = gives the fixed point 
condition 

One can generalize these results in various ways. For instance, if we replace 
Euclidean loss by weighted Euclidean loss | J2i=i w i( x i ~ Ui) 2 \ then the MM 
update of the penalized loss has components 

m 

x n+ i,i = ■ — y. L H ■ — > ^ k P Ck {x n )i. 

Wi+fl Wi + [if- 1 

The quadratic penalty method suffers from roundoff errors and numerical in- 
stability for large \i. These arc mitigated in the MM algorithm since its up- 
dates ^ rely on stable projections and avoid matrix inversion. The slow rate 
+ fJt) of convergence for large /x is an issue. In practice one can improve 
the rate of convergence by starting fi small and gradually increasing it to its 
target value. For a fixed [i one can also accelerate the MM iterates by sys- 
tematic extrapolation. For instance, our quasi-Newton acceleration |39] often 
reduces the required number of iterations by one or two orders of magnitude. 

Projection as a Dual Program 

For the sake of comparison, we describe a dual algorithm for solving the pro- 
jection problem. This alternative algorithm can be accelerated by Nesterov's 
method 28,2 . The unaccelerated dual algorithm is a variation of Dykstra's 
algorithm [16], which solves the dual problem by block descent. 

To derive the dual problem, we first observe that the primal problem con- 
sists of minimizing 

m 

2 \\ x - y\\l + ^2 s c l ( x *) 

i=l 

subject to Xx = x, . . . , x m = x. The Lagrangian for the primal problem is 
C{x,x x , . . .,x m ,zi, . . .,z m ) = -\\x-y\\ 2 2 - (X^ 1 ) x + ^2\ 5 c l { x i) + z \ x i\- 

i=l i=l 
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If z = (zi, . . . , z m ) denotes the concatenation of the dual variables Zi, then 
the dual function 

V(z) = inf C(x,x 1 ,...,x m ,z 1 ,...,z m ) 

X.X\ ,....x m 

reduces to 

V{z) = -l\\s\\l - s f y - Y^L^wp^c.i-ztxi), 

where s = Y^iLi z i- The dual function can be maximized by the proximal 
gradient algorithm 

m 

z? +1 <^z? + [P Ct {x n -z?)-x n ]. 

For the sake of clarity, we have adopted novel notation in this derivation; 
x™ and zf denote the nth MM iterate of the ith primal and dual variables 
respectively. 

Derivation of this algorithm and its Nesterov acceleration (FISTA) appear 
in the Appendix. The dual updates, which are essentially projection steps, can 
be computed in parallel. Thus, the dual algorithm matches the MM algorithm 
in this regard. 

Projecting onto the set of doubly nonnegative matrices 

As a numerical example, consider the problem of projecting a symmetric ma- 
trix onto the set of doubly nonnegative matrices, namely the intersection of 
the set of nonnegative matrices with the set of positive semi-definite matrices. 
Many covariance matrices - for example, kinship matrices in statistical genet- 
ics - have nonnegative entries. Projection onto each of the component sets is 
relatively easy while projection onto the intersection is not. Projecting onto 
the set of nonnegative matrices is accomplished by setting all negative entries 
of a matrix to zero. Projecting onto the set of positive semi-definite matrices 
is accomplished by truncating the eigenvalue decomposition of the matrix and 
rejecting all outer products with negative eigenvalues. 

As a test case, we generated a 200-by-200 matrix with i.i.d. entries drawn 
from a standard normal distribution. After projecting the simulated matrix 
onto the space of symmetric matrices, we compared the distance majoriza- 
tion algorithm to its quasi-Newton acceleration (2 secants), the dual proximal 
gradient algorithm, and its FISTA acceleration. We implemented the MM al- 
gorithm with the geometrically increasing sequence /ii = 2 l — 1 of penalty 
constants u. The decision to switch to the next larger // was based on the ratio 
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50 100 150 200 250 300 



Iteration 

Fig. 1: A comparison of the MM algorithm, its quasi- Newton acceleration, the 
dual proximal gradient algorithm, and its FISTA acceleration applied to the 
problem of projecting a 200 x 200 matrix onto the set of doubly nonnegative 
matrices. 

Whenever this ratio fell below p = 10 -4 , we updated pi. To track the progress 
of each algorithm, we calculated two measures of constraint violation by the 
current matrix: (a) the absolute value of the most negative eigenvalue, and (b) 
the absolute value of the most negative entry. Figure [l] plots the maximum 
of the two constraint violations on a log scale at each iteration. The abrupt 
transitions in the MM and quasi-Newton MM paths reflect the switch points 
for the penalty constant fi. Obviously, the amount of work done in each iterate 
varies across the methods. For a more direct comparison, Table [T] records 
several statistics, including run times in seconds. In the table, the distance 
column conveys the Frobenius norm of the difference between the simulated 
matrix and the fitted matrix. The two featured algorithms perform about 
equally well. As expected, their accelerated versions do much better. 

Shape-Restricted Regression 

Isotone regression minimizes the least squares criterion ^Y^i=i w i(yi ~ x i) 2 
subject to the isotonic constraint x\ < ■•• < x n . This problem is readily 
amenable to the projection algorithm. Projection onto the isotone convex cone 

C = {x : x\ < xi < • • • < x n } 
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Table 1: Timing comparisons and constraint violations for projecting onto the 
set of doubly nonnegative matrices. 



Method 


Time (sec) 


Iterations 


Distance 


Constraint Violation 


MM 


16.526 


290 


120.9110 


-0.0048710012 


MM-QN 


11.098 


98 


120.9131 


-0.0007433297 


Dual 


19.882 


144 


120.9131 


-0.0009122053 


Dual (Acc.) 


13.926 


99 


120.9136 


-0.0009862162 




1.0 1.5 2.0 2.5 3.0 

X 



Fig. 2: Fitted data for isotonic regression. 

is rapidly accomplished by the pool adjacent violators algorithm [Tl l3"Tll37| ■ 
More complicated order restrictions such as Xi < Xj for all arcs in a 

directed graph can be handled as well. In this setting all components of a 
vector x projected on the convex are left untouched 

except components Xi and Xj, These are left untouched when xi < Xj. Both 
Xi and Xj are replaced by their average when Xi > Xj. 

We considered the problem of fitting a nondecreasing function to the data 
shown in Figure [2] (black dots). Each observed pair (xi,yi) was generated as 
follows. The equally spaced points between 1 and 3, and the yi satisfy 

U i — % ~t~ ^ i i 

where the q are i.i.d. standard normal deviates. For the MM algorithms we 
used the geometrically increasing sequence of penalty constants fii featured in 
the previous example and two secant conditions for the quasi-Newton accel- 
eration. We switched to the next value of fj, whenever the stopping criterion 
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5000 10000 15000 20000 

Iteration 



Fig. 3: A comparison of the MM algorithm, its quasi-Newton acceleration, 
the dual proximal gradient algorithm, and its FISTA acceleration applied to 
a univariate isotonic regression problem. 

(JTJ fell below p — 10~ 6 . A looser threshold p = resulted in unacceptably 
poor fits for these data. 

To track the progress of each algorithm, we measured the constraint viola- 
tion of an iterate as the maximum absolute constraint violation between two 
successive parameters. Figure [2] shows that all four algorithms return similar 
solutions under the specified stopping rule. Figure [3] plots the constraint vi- 
olation for each method on a log scale. Table [2] compares timing results and 
constraint violations at convergence. In the table the distance column conveys 
the Euclidean norm of the difference between observed points and fitted points. 
Compared to the previous example, we see an even greater improvement in 
the performance in the accelerated versions of the two algorithms. In general, 
it is safe to conclude that distance majorization is a viable alternative to its 
most likely fastest competitor in non-smooth convex optimization. 

Least Squares Fitting with Convex Functions 

Given responses yi, predictor vectors in W, and case weights lUj, convex 
regression seeks to minimize the sum of squares of residuals 

1 - 

2 z2 w i(yi -°i) 2 
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Table 2: Timing comparisons and constraint violations for the isotonic regres- 
sion example. 



Method 



Time (sec) Iterations Distance Constraint Violation 



MM 

MM-QN 
Dual 

Dual (Acc.) 



24.45 
3.27 

12.70 
5.46 



19651 
863 
6526 



9.633144 
9.677731 
9.637778 



2578 9.677847 



-1.330351e-02 
-4.869077e-05 
-1.525531e-02 
-4.104088e-05 



subject to the constraints $,\{xj — Xi) < 9j — 6i for every ordered pair 
8 . In effect, 9, is viewed as the value of the regression function 0(x) at the 
point Xi. The unknown vector € R p serves as a subgradient of 9(x) at a^. 
Because convexity is preserved by maxima, the formula 



6(x) = max 



3 



^(x-Xj 



defines a convex function with value 9i at x = Xi . In concave regression the op- 
posite constraint inequalities are imposed. Interpolation of predicted values in 
this model is accomplished by simply taking minima or maxima. Estimation re- 
duces to a positive semidefinite quadratic program involving n(jp + 1) variables 
and n(n — 1) inequality constraints. Note that the feasible region is nontrivial 
because it contains the point (0, 3) = (0, 0), where 3 = . . . , £ n ]. 
The penalized objective function is 

1 n 

U(e, s) = - £ 9 t f + f£ d% k (e, s), 

i=i jjtk 

where C jk = {(0, 3) : £{( Xj -x k ) < 9 3 -9 k }. Let P Cjk (9, S)< and P Cjk (e, 3) 1 
denote the components of Pc jk (0, 3) relevant to 9i and £ z , respectively. The 
surrogate function 

n n 

g „(e, 3 | e m , s m ) = - mivi - + f P* ~ p ^ s ^\\l 

i—l i—1 j^k 

n 



7, - Pc ]k ( a rn,3 m ) 1 

1 = 1 j^k 



admits the minimizer 

Wi 



'm+l,i — 



Wi+n(n—l)fj, 1 Wi + n(n—l)(i 



1 \ ^ ] PCj k (°m, a m)i 



i m+ u = Hn - I)]" 1 J2 p c Jk (0 mt 3, 

3^k 
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The projection operator Pc jk is easy to compute because Cjk is a half- 
space. Furthermore, if we define the quantities 



and rjk 



2 + || Xj - Xfc||| 



for j ^ k, 



then the sums entering the MM updates reduce to 

n n 



3^k 



k = l 



3=1 



Pc ]k (0, S) 1 = n(n - l)£ z - r jt (x s - x t ) 

j^k j=l 

evaluated at 6 = 8 m and 3 = S m . 

Figure [4] displays a randomly generated data set with 51 data points and 
the corresponding least squares fit with convexity constraints. We employed 
the same geometrically increasing sequence of fi used earlier, took five secant 
conditions for the quasi-Newton acceleration, and set the stopping criterion 
([T]) to p = 10~ 8 . The MM algorithm requires 8940 iterations and 4.12 seconds 
in total to achieve the objective value of 1.0709 and the maximal constraint 
violation at order of 7 x 10 -9 . 



Support Vector Machine 

Given data (yi,Xi), i — l,...,n, where j/j G {—1,1} and Xi G R p , the 
goal of discriminant analysis is to choose classification labels yi using the 
p-dimensional predictor x^. The support vector machine (SVM) [38] is one of 
the most popular classifiers and potentially benefits from distance penaliza- 
tion. Here the problem is to minimize the quadratic loss function 

" A 
/(0,6, e )=^6. i + -||0|| 2 

i=l 

subject to the inequality constraints 

l-y t (b + x\6) < a 

using slack variables £j > 0. See Example 15.5.2 of the book [53] for further 
details about problem formulation and passing to the dual. In the following we 
assume that the first element of Xi is 1, and thus the intercept b is absorbed 
in the parameter 8. Then the penalized objective function is 

n * n 

/, l ( e ,0) = 5> + -ll0||2 + |£4 j ( e ,0) 
i=i j=i 
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Fig. 4: Fitted data for convex regression. 



where Cj = {(e, 6) : ej + UjX^B > 1}. Minimizing the surrogate function 

n y n 

ff M (e,fl|6 m ,e m ) = 5^e i + ? ||fl||5 + ^||e-Pc7 J (6 m ,e ro ) e ||l 



^||0-P c .(e m ,0, 



subject to the non-negativity of ej yields the next iterate 



£m+l — — 

n 



Qm+i — y~ — Pcj ( e m, m )e- 



Because Cj is a half-space, 

-Pc, (em,0m) = f q 



1 - £ mj - 2/j£cj-^ 



-I + 
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where the vector has all entries equal to except for a 1 at entry j. 

We report the results on an example SVM problem with a training data 
set of n = 1371 observations and p — 7 features. We employed the same 
geometrically increasing sequence of \x and the same stopping criterion p used 
in the previous example. At A = 10, the MM algorithm takes 14,432 iterations 
and 2.69 seconds to achieve the objective value 489.0058 and the maximal 
constraint violation 8.6 x 10~ 9 . 

As a generalization, consider the kernel SVM [35] attractive in handling 
p » n problems. The optimization problem is to minimize 



2 = 1 



A 



subject to 



b + °3 K ( 2 



< e, and e, > for all i. 



Common choices of kernels include the polynomial kernel 



and the Gaussian kernel 



K(xi,Xj) = exp < - 



\Xi X 



J 112 



2a 2 



Since K is positive semi-definite, it can be expressed in terms of a Cholesky 
decomposition K = LL l . With reparameterization a = L f 0, the problem 
transforms to 



mm 

b,e,a 



i=l 



subject to 



l - y% 




< e, and e, > for all i, 



which is essentially the same as the original SVM. The Cholesky decomposi- 
tion costs n 3 /6 flops and might be a concern for data with huge number of 
observations. Some kernels used in genomics are naturally low rank with triv- 
ial Cholesky factors L and L*. Even for a full-rank kernel K, one can resort to 
the fast Lanczos algorithm (TJj to extract its top r eigen-pairs K w U r D r U* 

1/2 

and set L = U r D r , an n x r matrix. 
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The Fire Station Problem 

As a final example, we show that distance majorization need not be fettered to 
Euclidean distances. Indeed, Euclidean distances may be inappropriate in some 
problems. Consider the problem of determining the optimal location of a new 
fire station in a city where the streets occur on a rectangular grid. The station 
should be situated to guarantee the shortest routes to several major buildings 
spread throughout the city. This is just the generalized Heron problem with 
the l\ norm substituting for the Euclidean norm The projection operators 
Pc{x) are now harder to calculate. Indeed, they are often sets rather than 
points. Fortunately, when C is a rectangle [a, b] with sides parallel to the 
standard axes, Pq{x) is a point with components 

!<H Xi < at 
Xi di < Xi < bi 
h Xi > bi. 

To minimize the objective function, we minimize the surrogate function 

m p 

g(x | x n ) = ^2^2\xj - P^ t {x n )j\. 

»=1 3=1 

Because the t\ norm separates variables, we obtain a very simple update for- 
mula. 

x n +i,j = median [Pc^XnJj, ic m Wi] • 

Consider the example where the buildings have centers (—7, 0.5), (—5,-8), 
(4,7), (5,2), and (—4,6) and half-side lengths of 0.5. Minimizing the sum of 
t\ and li distances yields the results shown in Figure [5j The optimal position 
clearly depends on the underlying norm. For more general l\ problems, the 
solution may not be unique because the projection operator does not reduce 
to a single point. 



4 Convergence Analysis 

We now prove convergence of the distance majorization algorithm under con- 
ditions pertinent to Euclidean distances. Let us first consider the conver- 
gence of the MM algorithm for solving subproblem pi). The convergence 
theory of MM algorithms hinges on the properties of the algorithm map 
i[>(x) — argmin y g(y | x). For easy reference, we state a simple version of 
Meyer's monotone convergence theorem |26j instrumental in proving conver- 
gence in our setting. 
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(a) i\ norm (b) £2 norm 

Fig. 5: Optimal location for the fire station. 



Proposition 1 Let f{x) be a continuous function on a domain S and ip(x) 
be a continuous algorithm map from S into S satisfying f(ip(x)) < f{x) for 
all x £ S with ip(x) =/= x. Suppose for some initial point Xo that the set 

C f (x ) = {xeS : f(x) < f(x )} 

is compact. Then (a) lim m _ i . 00 ||a; m+1 — x m \\ = 0, (b) all cluster points are 
fixed points of tp(x), and (c) x m converges to one of the fixed points if they 
are finite in number. 

The function f(x) is the objective to be minimized. In our context, the ob- 
jective function is f^{x) = £(x) + ^ J_i=i 7« dist(x, C^) 2 . We make the fol- 
lowing assumptions: (a) i(x) is continuously differentiable and £(x) + % \\x\\ 2 
is convex for some constant k > 0, (b) f K (x) is coercive in the sense that 
linijj3.jj_j.00 f K (x) = 00, and (c) // > k. Note that f^{z) inherits coerciveness 
from f K (z). Assumption ([b]) is met in several different scenarios, for example, 
if at least one of the Ci is bounded or if £(x) itself is coercive. 

Proposition 2 The cluster points of the MM iterates for solving subproblem 
|5J) are stationary points of the objective function under assumptions |a|) 
through above. If the number of stationary points is finite, then the MM 
iterates converge. Finally, if f^(x) has a unique stationary point, then the MM 
iterates converge to that stationary point, which globally minimizes fu(x). 

Proof We first argue that the surrogate function g^ (y \ x) is strongly convex, 
a crucial fact invoked later. For all x,y, and z, Assumption (Ja| implies 



yf > £(z) + f \\z || 2 + [W(z) + nz]\y - z), 
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which in turn entails 

e(y)>i(z) + ve(z) t ( y -z) + K 



! ■ ■ \\\zf + z\y-z)- l -\\y\\ 



(8) 



(9) 



= i(z)+V£(zy(y-z)- %\\x-y\\ 

The quadratic expansion 

\\y - P Ci (x)\\ 2 = \\y - z + z - P Ci (x)\\ 2 

= \\z- PcMf + 2[* - Pc,{x)f{y - z) + \\y - z\ 
also holds. Combining inequality ^ with equality ([9| leads to 

9»{y I x) > g^z | x) + Vg^z \ x)\y - z) + ^—^\\z - y\\ 2 2 . 



which is equivalent to the strong convexity of y i— >• g^(y \ x). In view of this 
result, y \— > g^(y \ x) has a single stationary point, which is also its unique 
global minimizer. 

We now proceed to check the conditions given in Proposition [I] It is easy 
to verify that f^{x) is continuous. We must also show that ip(x) is continuous. 
Take an arbitrary convergent sequence x n that tends to the limit x. It suffices 
to prove that the sequence y n — ip(x n ) tends to y = ip(x). Now there exists 
a constant b such that 

U(x n ) < f^x) + b 

for all x n . In view of the descent property, we have fn(y n ) < f^{x) + b as 
well. Hence, coerciveness implies y n is bounded. Consider any convergent sub- 
sequence y nk with limit z. The points y nk and related through the 
stationarity condition 

n i 

= V%„ fc ) + a» 1* iVn„ - P C ( X n k )] ■ 
1=1 

Since £{x) is continuously differentiable and Euclidean projections are contin- 
uous functions, taking limits gives, 

m 

= W(*)+p£7i[*-P c ,(a:)] = V 5m (* | x). 
i=i 

Because the surrogate function z 4 ^(z | x) possesses a unique stationary 
point y, the subsequence y nk converges to y. Given this conclusion for all 
subsequences of the bounded sequence y n , the sequence y n in fact converges 
to y. 

The strict descent property of ip(x) follows from the uniqueness of the 
global minimizer of g^(y \ x). Because ffi(x) is coercive and continuous (in 
fact, continuously differentiable), the set Cf (xq) is compact for any initial 
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point Xq. Therefore, Proposition [T] implies that all cluster points of the se- 
quence x n+ i = ip(x n ) are fixed points. Since X7f^(x) — \7g^{x \ x), fixed 
points coincide with stationary points of }^{x). If f^{x) has finitely many 
stationary points, conclusion ([c]) of Proposition [I] implies that the iterates 
converge to one of the stationary points. If the coercive function /^(x) pos- 
sesses a single stationary point, then that point represents a global minimum, 
and the MM iterates x n converge to it. 

Observe that Proposition[2]does not explicitly require the loss function £{x) 
to be convex. This is in sharp contrast to the strong convexity condition on 
£(x) needed to ensure the global convergence of the dual ascent algorithm. The 
convergence of the dual ascent method is discussed further in the Appendix. 
For a sequence of penalization parameters /if. t 00 : we intuitively expect the 
solutions to the penalized problems to approach a solution to the original 
problem. Indeed, this is the case. We restate Theorem 17.1 in j29j in our 
notation. 

Proposition 3 Suppose each x(fik) exactly solves subproblem Q), and that 
fj, k t 00. Then every cluster point of the sequence x(fj>k) is a global solution to 
the original problem 

When £(x) is coercive and possesses a unique minimizer subject to the con- 
straints, one can justify the stronger claim that the sequence x(fik) converges 
to the minimizer. Under these assumptions the sequence x(jj,k) is bounded and 
possesses exactly one cluster point. Therefore, the sequence x([ik) converges 
to that cluster point. Boundedness of x(/j,k) follows from the inequalities 

£[x(n)} < fMl*)] < Mv) = *(V) 
for any feasible point y. 

5 Discussion 

The MM principle is a versatile tool. Here we demonstrate how majorizing a 
distance function can be leveraged to solve a variety of optimization problems 
with non-trivial convex constraints. The resulting MM algorithms have sim- 
ple update formulas that open the door to straightforward parallelization and 
graceful handling of large data sets. In the case of projection onto an intersec- 
tion of closed convex sets, we have demonstrated that accelerated variants of 
the MM algorithm are competitive with the current state-of-the-art algorithms 
for solving non-smooth convex programs. 

Several of our examples rely on the classical penalty method. This raises 
the questions of how to select the ultimate penalty constant and how fast to 
increase it from a low starting value. The quality of our solutions and the 
rate of convergence of the MM algorithms depend on these choices. We have 
given some rough guidelines that work well in practice, but more theoretical 
and empirical insight would be helpful. We have not encountered disastrous 
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numerical instabilities in using the penalty method, partially because all of 
our computations were carried out in double precision. 

Distance majorization works best for Euclidean distance. This follows from 
the fact that explicit formulas are available for several important projection op- 
erators. For others, such as projection onto the unit simplex, fast algorithms 
have been devised. Nonetheless, as the fire station example shows, distance 
majorization can be applied to non-Euclidean distances. Is it possible to de- 
vise fast MM algorithms for computing non-Euclidean distances? This is an 
problem area deserving more thorough study. 

Another intriguing issue is the application of distance majorization to min- 
imization of non-convex loss functions £(x) over the intersection of convex sets. 
In statistics, many useful robust parameter estimates employ non-convex £(x), 
for example Tukey's biweight function and more generally M-estimators |20) . 
Although the strongest convergence guarantees require the uniqueness of a 
global solution, much of the convergence theory remains intact if convexity 
no longer assumed. Our convergence theory shows that the convexity assump- 
tion on £(x) can be relaxed. Extending these results and working practical 
examples are worthy targets of future research. 
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Appendix 

We derive a modest generalization of an iterative algorithm for the dual of the 
projection problem [T4]. Because constructing the dual program and the asso- 
ciated projected gradient algorithm are exercises in modern convex analysis, 
we first review a few key facts from this discipline. Readers can consult the 
references [BU7., 19,32,34] for proofs and further background material. 
The Fenchel conjugate f*(y) of a function f(x) is defined as 

f*(y) = sup [y l x - f{x)] . 

X 

When f(x) is convex and lower semicontinuous, it satisfies the biconjugate 
relation f**(x) = f(x). In particular, the conjugate of the indicator function 
5c (x) of a set C is the support function 

S* c {y) = sup y l x 
xec 

of C. When C is closed and convex, 8 c *{x) — Sc(x). 

Recall that a function f(x) is strongly convex with parameter r\ > if 
the difference f(x) — %\\x\\% is convex. Thus, a strongly convex function has a 
curvature bounded away from zero. If f(x) is strongly convex, then the value 
f*(y) is attained at a single point x. In this case, f*(y) is differentiable with 
gradient V/*(y) = x. Furthermore, V/*(y) satisfies the Lipschitz inequality 

||V/*(z)-V/*(y)|| 2 < -||*-»|| a . 

Lipschitz continuity ensures global convergence of the proximal gradient algo- 
rithm for solving the dual problem. The proximity-operator prox ?l (;z) associ- 
ated with a function h(x) is defined as 



proxjj(z) = argmin 



h{x) + -\\z-x\ 



Here the right hand side has a unique minimizer whenever h(x) is convex 
and lower semicontinuous. The proximal gradient method |28] is guaranteed 
to minimize the function f(x) + g(x) when f(x) is differentiable, convex, and 
has a Lipschitz continuous gradient, and g(x) is lower-semicontinuous and 
convex. The proximal gradient method iterates according to 

x n+1 = prox CT9 [x n - aVf(x n )} , 

where a denotes a step size and x n the nth iterate. We recover the classic 
gradient descent method when g(x) is the zero function, and we recover the 
projected gradient algorithm when g(x) — &c{ x ) is the indicator of a closed 
convex set C. Thus, the proximal gradient algorithm generalizes two important 
algorithm classes. 

We are now ready to derive an iterative algorithm for solving the dual 
program of interest. Consider the slightly more general problem of minimizing 
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a strongly convex function f(x) over the intersection of a finite collection of 
closed convex sets C\ , . . . , C m . This problem can be reposed as minimizing the 
function 

m 

t=l 

subject to the constraints x\ = x, . . . , x m = x. The Lagrangian for this prob- 
lem 

m . m 

L(X,X 1 , ...,X m ,Zl,.. .,Z m ) = f(x) - (y^-gj) X + ^ [ S Ci(Xi) + z\xi] . 

gives rise to the dual function 

m 

V{z) = -f{z x + ■■■ + z m ) - hi(zi), 

i=l 



where z = (z±, . . . , z m ) denotes the concatenation of the dual variables Zi and 
hiizi) is the support function of the set Ci at —Zi. Thus, the dual problem of 
maximizing T>(z) is equivalent to minimizing f*(zi + • • • + z m ) + hi(zi). 
Given that f(x) is strongly convex, f*(z) is differcntiable and in fact V f*{z) 
is Lipschitz continuous. Therefore, the dual is a prime candidate to be solved 
via the proximal gradient method. Since hi{ z i) separates the variable Zi 1 
the dual proximal gradient step can be computed blockwise as 

z? +1 = prox CT/l! [z? - aVr(z? + ■■■ + z n m )\ , (10) 

where a denotes a step size. The algorithm simplifies further by applying the 
Moreau decomposition [T31 Lemma 2.10]. 

u = prox CT/l . (m) + aprox h * /(7 (u/a). (11) 
Note that prox h * /^(u/a) = —Pc\(~u/a) and Vf*(s) is a minimizer of the 



convex function f(x) — s*x. Combining these identities with (10 1 and (11) 
gives the algorithm 

x n = argmin [f(x) - (z? H h z n m ) t x\ 

X 

z^ +1 =z- + a[P Cl {x n -o- l z'l)-x n }. 



Convergence is assured by setting the step length a = r/, where I/77 is the 
Lipschitz constant of V f*{z). Thus, the strong convexity condition on f(x) is 
actually required for convergence, since a closed, convex function / is Lipschitz 
continuous if and only if its conjugate function is strongly convex [21 j . 
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The Nestcrov acceleration mentioned earlier requires just a minor adjust- 
ment. The first two iterates are computed as above; subsequent updates use 
the following extrapolation steps. 

x n = argmin [f(x) -(*» + ... + 



77 — 9 



s n-lj 

v n+l n , rn ( ~* n +~ ^- o n \ 
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